Personal Recommendation via Modified Collaborative Filtering 
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In this paper, we propose a novel method to compute the similarity between congeneric nodes in bipartite 
networks. Different from the standard cosine similarity, we take into account the influence of node's degree. 
Substituting this new definition of similarity for the standard cosine similarity, we propose a modified collabo- 
rative filtering (MCF). Based on a benchmark database, we demonstrate the great improvement of algorithmic 
accuracy for both user-based MCF and object-based MCF. 
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I. INTRODUCTION 



^ Recently, recommendation systems are attracting more and more attentions, because it can help users to deal with information 

I overload, which is a great challenge in the modern society, especially under the exponential growth of the Internet 1 1 ] and 
^ the World-Wide-Web 121 ■ Recommendation algorithm has been used to recommend books and CDs at Amazon.com, movies 
at named Netflix.com, and news at VERSIFI Technologies (formerly AdaptiveInfo.com) |3]. The simplest algorithm we can 
' O t use in these systems is global ranking method (GRM) [4], which sorts all the objects in the descending order of degree and 
c/2 ' recommends those with highest degrees. GRM is not a personal algorithm and its accuracy is not very high because it does not 
. ^ ^'^^^ account the personal preferences. Accordingly, various kinds of personal recommendation algorithms are proposed, 
for example, the collaborative filtering (CF) |5, 6], the content-based methods |7, 8], the spectral analysis |9, 10], the principle 
component analysis 1 1 1], the diffusion approach |4, ijj fTslfTill . and so on. However, the current generation of recommendation 
systems still requires further improvements to make recommendation methods more effective |3]. For example, the content 
analysis is practical only if the items have well-defined attributes and those attributes can be extracted automatically; for some 
multimedia data, such as audio/video streams and graphical images, the content analysis is hard to apply. The collaborative 
■"si" filtering usually provides very bad predictions/recommendations to the new users having very few collections. The spectral 
analysis has high computational complexity thus infeasible to deal with huge-size systems. 

Thus far, the widest applied personal recommendation algorithm is CF 1|3|,115|]. The CF has two categories in general, one 
is user-based (U-CF), which recommends the target user the objects collected by the users sharing similar tastes; the other is 
I object-based (O-CF), which recommends those objects similar to the ones the target user preferred in the past. In this paper, we 
introduce a modified collaborative filtering (MCF), which can be implemented for both object-based and user-based cases and 
achieve much higher accuracy of recommendation. 



II. METHOD 



' We assume that there is a recommendation system which consists of m users and n objects, and each user has collected some 
H objects. The relationship between users and objects can be described by a bipartite network. Bipartite network is a particular 
class of networks ||4j,[16l], whose nodes are divided into two sets, and connections among one set are not allowed. We use one set 
to represent users, and the other represents objects: if an object Oi is collected by a user uj, there is an edge between Oi and Uj, 
and the corresponding element in the adjacent matrix A is set as 1, otherwise it is 0. 
In U-CF, the predicted score Vij (to what extent Uj likes Oi), is given as : 

m 

Vij = ^ Silttjl, (1) 
l=l,l^i 
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where su denotes the similarity between Ui and ui. For any user Ui, all Vij are ranked by values from high to low, objects on the 
top and have not been collected by Uj are recommended. 

How to determine the similarity between users? The most common approach taken in previous works focuses on the so-called 
structural equivalence. Two congeneric nodes (i.e. in the same set of a bipartite network) are considered structurally equivalent 
if they share many common neighbors. The number of common objects shared by users Ui and uj is 

n 

Cij = ^ auaij, (2) 
1=1 

which can be regarded as a rudimentary measure of su. Generally, the similarity between Ui and uj should be somewhat relative 
to their degrees IJ7i1 . There are at least three ways previously proposed to measure similarity, as: 



k{ui) + k{uj)' 

Sij = (4) 
^/k{ui)k{uj) 

n ■ ■ 

(5) 



min(k(ui) , k{uj)) 

The Eq.(3) is called Sorensen's index of similarity (SI) ifisll . which was proposed by Sorensen in 1948; the Eg. (4), called the 
cosine similarity, was proposed by Salton in 1983 and has a long history of the study on citation networks jlTIl : the Eq.(5) is 
called Pearson correlation. Both the Eq.(4) and Eq.(5) are widely used in recommendation systems [Ijll^l- 

A common blemish of Eqs. (3)-(5) is that they have not taken into account the influence of object's degree, so the objects 
with different degrees have the same contribution to the similarity. If user Ui and Uj both have selected object o;, that is to say, 
they have a similar taste to the object o;. Provided that object o; is very popular (the degree of oi is very large), this taste (the 
favor for oi) is a very ordinary taste and it does not means Ui and Uj are very similar Therefore, its contribution to should 
be small. On the other hand, provided that object o; is very unpopular (the degree of o; is very small), this taste is a peculiar 
taste, so its contribution to should be large. In other words, it is not very meaningful if two users both select a popular object, 
while if a very unpopular object is simultaneously selected by two users, there must be some common tastes shared by these 
two users. Accordingly, the contribution of object o; to the similarity Sij (if Ui and Uj both collected o;) should be negatively 
correlated with its degree fc(o;). We suppose the object o;'s contribution to Sij being inversely proportional to k°'{oi), with a a 
freely tunable parameter The Sij, consisted of all the contributions of commonly collected objects, is measured by the cosine 
similarity as shown in Eq. (4). Therefore, the proposed similarity reads: 



' ^k{u^)k{u,)fr{k^{oiy 

Note that, the influence of object's degree can also be embedded into the other two forms, shown in Eq. (3) and Eq. (5), and the 
corresponding algorithmic accuracies will be improved too. Here in this paper, we only show the numerical results on cosine 
similarity as a typical example. 

For any user-object pair Ui-Oj, if Ui has not yet collected oj, the predicted score can be obtained by using Eq. (1). Here we 
do not normalize Eq. (1), because it will not affect the recommendation list, since for a given target user, we need sort all her 
uncollected objects, and only the relative magnitude is meaningful. Note that, if two objects have exactly the same score, their 
order is randomly assigned. We call this method a modified collaborative filtering (U-MCF), for it belongs to the framework of 
U-CR 



III. NUMERICAL RESULTS 



Using a benchmark data set namely MovieLens 1 19], we can evaluate the accuracy of the current algorithm. The data consists 
of 1682 movies (objects) and 943 users. Actually, MovieLens is a rating system, where each user votes movies in five discrete 
ratings 1-5. Hence we applied a coarse-graining method used in Refs. ||4|,112|]: A movie has been collected by a user if and only 
if the giving rating is at least 3 (i.e. the user at least likes this movie). The original data contains 10^ ratings, 85.25% of which 
are > 3, thus the data after the coarse gaining contains 85250 user-object pairs. The current degree distributions of users and 
objects were presented in Fig. 1. Clearly, the degree distributions of both users and objects obey an exponential form. To test 
the recommendation algorithms, the data set is randomly divided into two parts: The training set contains 90% of the data, and 
the remaining 10% of data constitutes the probe. Of course, we can divided it in other proportions, for example, 80% vs. 20%, 
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FIG. 1: The degree distributions of users (left panel) and objects (right panel) in linear-log plot, where P{k) denotes the cumulative degree 
distribution. 
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FIG. 2: The effect of parameter a in U-MCF. The ranking score has its minimal at about a = 1.85, at almost the same point, the recall and 
precision achieve their maximums. Present results are obtained by averaging over four independent 90% vs. 10% divisions. The error bars 
denote the standard deviations. 



70% vs. 30%, and so on. The training set is treated as known information, while no information in probe set is allowed to be 
used for prediction. 

A recommendation algorithm could provide each user a recommendation list which contains all her/his uncollected objects. 
There are several measures for evaluating the quality of these recommendation lists generated by different algorithms. In this 
paper, we use ranking score, recall and precision to measure the effectiveness of a given recommendation approach. Good 
overview of these measures can be found in Ref j6]. 

Ranking score. For an arbitrary user Ui, if the relation Ui-Oj is in the probe set (according to the training set, Oj is an 
uncollected object for Ui), we measure the position of Oj in the ordered queue. For example, if there are 1000 uncollected 
movies for ui, and Oj is the 10th from the top, we say the position of Oj is the top 10/1000, denoted by = 0.01. Since the 
probe entries are actually collected by users, a good algorithm is expected to give high recommendations to them, thus leading 
to small r. Therefore, the mean value of the position value (7-) (called ranking score [4]), averaged over all the entries in the 
probe, can be used to evaluate the algorithmic accuracy. The smaller the ranking score, the higher the algorithmic accuracy, and 
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FIG. 3: (Color online) (a): The predicted position of each entry in the probe ranked in the ascending order, (b): The precision for different 
lengths of recommendation lists, (c): The recall for different lengths of recommendation lists. 



vice verse. The definition of ranking score here is slightly different from that of the Ref. It is because if a movie or user in 
the probe set has not yet appeared in the training set, we automatically remove it from the probe and the number of total movies 
was counted only for the ones appeared in the the training set; while the Ref. |4] takes into account those movies only appeared 
in the probe via assigning zero score to them. This slight difference in implementation does not affect the conclusion. 

Recall is defined as the ratio of number of recommended objects appeared in the probe to the total number of objects. The 
larger recall corresponds to the better performance. Recall is also called hitting rate in literature 10] • 

Precision is defined as the ratio of number of recommended objects appeared in the probe to the total number of recommended 
objects. The larger precision corresponds to the better performance. Recall and precision depend on the length of recommenda- 
tion list L, we set L as 50 in our numerical experiment (in real e-commerce systems, the length of recommendation list usually 
ranges from 10 to 100 ^B]), therefor the total number of recommended objects is mL = 47150. 

Fig. 2 reports the algorithmic accuracy of U-MCF, which has a clear optimal case around a = 1.85. Fig. 3 (a) reports 
the distribution of all the position values, rij, which are sorted from the top position (r^ ^0) to the bottom position (r^j^l). 
Fig. 3 (b) and (c) report the recall and precision for different lengths of recommendation lists respectively. Fig. 4 reports the 
algorithmic accuracies of the standard case (a = 0) and the the optimal cases (a = 1.85) for different sizes of training sets. All 
these numerical results strongly demonstrate that to depress the contribution of common selected popular objects can further 
improve the algorithmic accuracy. 

Similar to the U-CF, the recommendation list can also be obtained by object-based collaborative filtering (O-CF), that is to 
say, the user will be recommended objects similar to the ones he/she preferred in the past 1I21I1 . By using the cosine expression, 
the similarity between two objects, Oi and Oj, can be written as; 

^ m 

Sij — — = y^auaji. (7) 

y/k{0i)k{0j) 



The predicted score, to what extent Ui likes Oj, is given as; 



Sjiaii. (8) 



Analogously, taking into account the influence of user degree, a modified expression of object-object similarity reads: 

a - ^ V (9) 

^k{o,)k{o,){^^k-{uiy 



where a is a free parameter. The modified object-based collaborative filtering (O-MCF for short) can be obtained by combining 
Eq. (8) and Eq. (9). Fig. 5 reports the algorithmic accuracy of O-MCF, which has a clear optimal case around a = 0.95. Fig. 6 
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FIG. 4: (color online) The standard CF (SCF) (i.e. q = ) vs. the optimal case for different sizes of training sets. 
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FIG. 5: The effect of parameter a in 0-MCF. The ranking score has its minimal at about a = 0.95, at almost the same point, the recall and 
precision achieve their maximums. Present results are obtained by averaging over four independent 90% vs. 10% divisions. The error bars 
denote the standard deviations. 



TABLE I: Three measures for different algorithms with probe set containing 10% data. For precision and recall, L — 50. Present results are 
obtained by averaging over four independent divisions. The values corresponding to U-MCF and O-MCF are the optimal ones. 



method 


< Ranking 


score> <Precision> 


<Recall> 


GRM 


0.1502 


0.3077 


0.0540 


O-CF 


0.1173 


0.4035 


0.0706 


U-CF 


0.1252 


0.3773 


0.0660 


O-MCF 


0.1019 


0.4443 


0.0777 


U-MCF 
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0.4108 


0.0719 
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FIG. 6: (color online) Similar to Fig.3. But for O-MCF. 
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FIG. 7: (color online) Similar to Fig.4. But for O-MCF. 



(a) reports the distribution of all the position values, , which are sorted from the top position (r^ — i-O) to the bottom position 
(rij^l), Fig. 6 (b) and (c) report the recall and precision for different lengths of recommendation lists respectively. Fig. 7 
reports the algorithmic accuracies of the standard case (a = 0) and the the optimal case (a = 0.95) for different sizes of training 
sets. All these results, again, demonstrate that to depress the contribution of users with high degrees to object-object similarity 
can further improve the algorithmic accuracy of object-based method. 



rV. CONCLUSION 



We compare the MCF, standard CF and GRM in Tab. I. Clearly, MCF is the best method and GRM performs worst. Compared 
with the standard CF, the modified object-based algorithm and the modified user-based method improve the accuracy in different 
extent in three measures. Ignoring the degree-degree correlation in user-object relations, the algorithmic complexity of U-MCF 
is 0{m?{ku) + mn{ko)), the O-MCF is O{n?{ko) + mn{ku)), respectively. Here and (fco) denote the average degree of 
users and objects. Therefore, one can choose either O-MCF or U-MCF according to the specific property of data source. For 
example, if the user number is much larger than the object number (i.e. m ^ n), the O-MCF runs much faster On the contrary. 
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if n ^ m, the U-MCF runs faster. Furthermore, the remarkable improvement of algorithmic accuracy also indicates that our 
definition of similarity is more reasonable than the traditional one. 
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